From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario

نویسندگان

  • Héctor Martínez Alonso
  • Djamé Seddah
  • Benoît Sagot
چکیده

User-generated content presents many challenges for its automatic processing. While many of them do come from out-of-vocabulary effects, others spawn from different linguistic phenomena such as unusual syntax. In this work we present a French three-domain data set made up of question headlines from a cooking forum, game chat logs and associated forums from two popular online games (MINECRAFT & LEAGUE OF LEGENDS). We chose these domains because they encompass different degrees of lexical and syntactic compliance with canonical language. We conduct an automatic and manual evaluation of the difficulties of processing these domains for part-of-speech prediction, and introduce a pilot study to determine whether dependency analysis lends itself well to annotate these data. We also discuss the development cost of our data set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Iranian EFL Learners L2 Reading Comprehension: The Effect of Online Annotations via Interactive White Boards

This study explores the effect of online annotations via Interactive White Boards (IWBs) on reading comprehension of Iranian EFL learners. To this aim, 60 students from a language institute were selected as homogeneous based on their performance on Oxford Placement Test (2014).Then, they were randomly assigned to 3 experimental groups of 20, and subsequently exposed to the research treatment af...

متن کامل

Some Challenges of Automated Annotation in A Multilingual Scenario

A key ingredient of today’s NLP scenario is annotation and this paper discusses challenges involved in one of the toughest annotation tasks which is sense marking. A large amount of data needs to be sense marked accurately by human annotators in order to train the machine to understand the spoken languages. The sense marked corpus for various languages facilitate the task of Word Sense Disambig...

متن کامل

Description Theory and Approach about Space Syntax Methodology in Urban Design Process

Nowadays space is considered to be a net of connections between different areas. The questions how to delineate its character and how it is possible to objectively compare two spaces is a matter under consideration. This was the reason for creating new theories about the structure of space. Space syntax analysis (SSA) or simply Space Syntax defines a number of theories and research areas devo...

متن کامل

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

Syntactic annotation of spoken utterances: A case study on the Czech Academic Corpus

Corpus annotation plays an important role in linguistic analysis and computational processing of both written and spoken language. Syntactic annotation of spoken texts becomes clearly a topic of considerable interest nowadays, driven by the desire to improve automatic speech recognition systems by incorporating syntax in the language models, or to build language understanding applications. Synt...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016